132 research outputs found
Robust sound event detection in bioacoustic sensor networks
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs),
can record sounds of wildlife over long periods of time in scalable and
minimally invasive ways. Deriving per-species abundance estimates from these
sensors requires detection, classification, and quantification of animal
vocalizations as individual acoustic events. Yet, variability in ambient noise,
both over time and across sensors, hinders the reliability of current automated
systems for sound event detection (SED), such as convolutional neural networks
(CNN) in the time-frequency domain. In this article, we develop, benchmark, and
combine several machine listening techniques to improve the generalizability of
SED models across heterogeneous acoustic environments. As a case study, we
consider the problem of detecting avian flight calls from a ten-hour recording
of nocturnal bird migration, recorded by a network of six ARUs in the presence
of heterogeneous background noise. Starting from a CNN yielding
state-of-the-art accuracy on this task, we introduce two noise adaptation
techniques, respectively integrating short-term (60 milliseconds) and long-term
(30 minutes) context. First, we apply per-channel energy normalization (PCEN)
in the time-frequency domain, which applies short-term automatic gain control
to every subband in the mel-frequency spectrogram. Secondly, we replace the
last dense layer in the network by a context-adaptive neural network (CA-NN)
layer. Combining them yields state-of-the-art results that are unmatched by
artificial data augmentation alone. We release a pre-trained version of our
best performing system under the name of BirdVoxDetect, a ready-to-use detector
of avian flight calls in field recordings.Comment: 32 pages, in English. Submitted to PLOS ONE journal in February 2019;
revised August 2019; published October 201
Melodic Transcription of Flamenco Singing from Monophonic and Polyphonic Music Recordings
We propose a method for the automatic transcription of flamenco singing from monophonic and
polyphonic music recordings. Our transcription system is based on estimating the fundamental frequency (f0)
of the singing voice, and follows an iterative strategy for note segmentation and labelling. The generated
transcriptions are used in the context of melodic similarity, style classification and pattern detection. In our
study, we discuss the difficulties found in transcribing flamenco singing and in evaluating the obtained
transcriptions, we analyze the influence of the different steps of the algorithm, and we state the main
limitations of our approach and discuss the challenges for future studies
Filler Word Detection and Classification: A Dataset and Benchmark
Filler words such as `uh' or `um' are sounds or words people use to signal
they are pausing to think. Finding and removing filler words from recordings is
a common and tedious task in media editing. Automatically detecting and
classifying filler words could greatly aid in this task, but few studies have
been published on this problem. A key reason is the absence of a dataset with
annotated filler words for training and evaluation. In this work, we present a
novel speech dataset, PodcastFillers, with 35K annotated filler words and 50K
annotations of other sounds that commonly occur in podcasts such as breaths,
laughter, and word repetitions. We propose a pipeline that leverages VAD and
ASR to detect filler candidates and a classifier to distinguish between filler
word types. We evaluate our proposed pipeline on PodcastFillers, compare to
several baselines, and present a detailed ablation study. In particular, we
evaluate the importance of using ASR and how it compares to a
transcription-free approach resembling keyword spotting. We show that our
pipeline obtains state-of-the-art results, and that leveraging ASR strongly
outperforms a keyword spotting approach. We make PodcastFillers publicly
available, and hope our work serves as a benchmark for future research.Comment: Submitted to Insterspeech 202
Efficient Spoken Language Recognition via Multilabel Classification
Spoken language recognition (SLR) is the task of automatically identifying
the language present in a speech signal. Existing SLR models are either too
computationally expensive or too large to run effectively on devices with
limited resources. For real-world deployment, a model should also gracefully
handle unseen languages outside of the target language set, yet prior work has
focused on closed-set classification where all input languages are known
a-priori. In this paper we address these two limitations: we explore efficient
model architectures for SLR based on convolutional networks, and propose a
multilabel training strategy to handle non-target languages at inference time.
Using the VoxLingua107 dataset, we show that our models obtain competitive
results while being orders of magnitude smaller and faster than current
state-of-the-art methods, and that our multilabel strategy is more robust to
unseen non-target languages compared to multiclass classification.Comment: Accepted to InterSpeech 202
- …